Voice-activity detection using energy ratios and periodicity

ABSTRACT

A voice activity detector ( 100 ) filters ( 204 ) out noise energy and then computes a high-frequency (2400 Hz to 4000 Hz) versus low-frequency (100 Hz to 2400 Hz) signal energy ratio ( 224 ), total voiceband (100 Hz to 4000 Hz) signal energy ( 214 ), and signal periodicity ( 208 ) on successive frames of signal samples. Signal periodicity is determined by estimating the pitch period ( 206 ) of the signal, determining a gain value of the signal over the pitch period as a function of the estimated pitch period, and estimating a periodicity of the signal over the pitch period as a function of the estimated pitch period and the gain value. Voice is detected ( 230 - 232 ) in a segment if either (a) the difference between the average high-frequency versus low-frequency signal energy ratio and the present segment&#39;s high-frequency versus low-frequency energy ratio either exceeds ( 310 ) a high threshold value or is exceeded ( 312 ) by a low threshold value, or (b) the average periodicity of the signal is lower ( 306 ) than a low threshold value, or (c) the difference between the average total signal energy and the present segment&#39;s total energy exceeds ( 304 ) a threshold value and the average periodicity of the signal is lower ( 304 ) than a high threshold value, or (d) the average total signal energy exceeds ( 412 ) a minimum average total signal energy by a threshold value and voice has been detected ( 410 ) in the preceding segment.

TECHNICAL FIELD

[0001] This invention relates to signal-classification in general and tovoice-activity detection in particular.

BACKGROUND OF THE INVENTION

[0002] Voice-activity detection (VAD) is used to detect a voice signalin a signal that has unknown characteristics. Numerous VAD devices areknown in the art. They tend to follow a common paradigm comprising apre-processing stage, a feature-extraction stage, a thresholdscomparison stage, and an output-decision stage.

[0003] The pre-processing stage places the input audio signal into aform that better facilitates feature extraction. The feature-extractionstage differs widely from algorithm to algorithm, but commonly-usedfeatures include (1) energy, either full-band, multi-band, low-pass, orhigh-pass, (2) zero crossings, (3) the frequency-domain shape of thesignal, (4) periodicity measures, and (5) statistics of the speech andbackground noise. The thresholds comparison stage then uses the selectedfeatures and various thresholds of their values to determine if speechis present in or absent from the input audio signal. This usuallyinvolves use of some “hold-over” algorithm, or “on”-time minimumthreshold, to ensure that detection of either presence of speech lastsfor at least a minimum period of time and does not oscillate on-and-off.

[0004] Some known VAD methods require a measurement of the backgroundnoise a-priori in order to set the thresholds for later comparisons.These algorithms fail when the acoustics environment changes over time.Hence, these algorithms are not particularly robust. Other known VADmethods are automatic and do not require a-priori measurement ofbackground noise. These tend to work better in changing acousticenvironments. However, they can fail when background noise has a largeenergy and/or the characteristics of the noise are similar to those ofspeech. (For example, the G.729 VAD algorithm incorrectly generates“speech detected” output when the input audio signal is a keyboardsound.) Hence, these algorithms are not particularly robust either.

SUMMARY OF THE INVENTION

[0005] This invention is directed to solving these and other problemsand disadvantages of the prior art. Generally, according to theinvention, voice activity detection uses a ratio of high-frequencysignal energy and low-frequency signal energy to detect voice. Theadvantage of using this measure is that it can distinguish betweenspeech and keyboard sounds better than simply using high-frequencyenergy or low-frequency energy alone. Preferably, voice activitydetection further uses a periodicity measure of the signal. While aperiodicity measure has been used in speech codecs for pitch-periodestimation and voiced/unvoiced classification, it is used here todistinguish between speech and background noise. Also preferably, voiceactivity detection further uses total signal energy to detect voice.Significantly, however, no initial decision about detection is based onthe total energy level alone. This makes the detection less susceptibleto non-speech changes in the acoustic environment, for example, tovolume changes or to loud non-speech sounds such as keyboard sounds.Furthermore, this makes it possible to use the detection for verylow-energy speech, which in turn makes the detection more robust insituations where a poor-quality microphone is used or where themicrophone recording-level is low.

[0006] Specifically according to the invention, voice activity detectioninvolves determining a difference between (a) an average ratio of energyabove a first threshold frequency in a signal—illustratively the signalenergy between about 2400 Hz and about 4000 Hz—and (b) energy below thefirst threshold frequency in the signal—illustratively the signal energybetween about 100 Hz and 2400 Hz—and (b) a present ratio of the energyabove the first threshold frequency in the signal and energy below thefirst threshold frequency in the signal, and indicating that the signalincludes a voice signal if the difference is either exceeded by a firstthreshold value or exceeds a second threshold value that is greater thanthe first threshold value. Preferably, the noise energy—illustratively,energy in the signal below about 100 Hz—is removed from the signal priorto the determining, so as to eliminate effects of noise energy on voiceactivity detection.

[0007] Preferably, the voice activity detection further involvesdetermining the average periodicity of the signal, and indicating thatthe signal includes a voice signal if the average periodicity is lowerthan a third threshold value. Illustratively, determining the averageperiodicity involves estimating a pitch period of the signal,determining a gain value of the signal over the pitch period as afunction of the estimated pitch period, and estimating a periodicity ofthe signal over the pitch period as a function of the estimated pitchperiod and the gain value.

[0008] Further preferably, the voice activity detection further involvesdetermining a difference between an average total energy in thesignal—illustratively the total energy in the voiceband from about 100Hz to about 4000 Hz—and present total energy is the signal, andindicating that the signal includes a voice signal if the differencebetween the average total energy and the present total energy exceeds afourth threshold value and the average periodicity of the signal islower than a fifth threshold value.

[0009] Further preferably, the voice activity detection is performed onsuccessive segments of the signal—illustratively on each 80 samples ofthe signal taken at a rate of 8 KHz. If there is not an indication thatvoice has been detected in the present segment but there is anindication that voice has been detected in the preceding segment, adetermination is made of whether the average total energy of the signalexceeds a minimum average total energy of the signal by a sixththreshold value. If so, an indication is made that a voice signal hasbeen detected in the present segment of the signal.

[0010] While the invention has been characterized in terms of methodsteps, it also encompasses apparatus that performs the method steps. Theapparatus preferably includes an effecter—any entity that effects thecorresponding step, unlike a means—for each step. The invention furtherencompasses any computer-readable medium containing instructions which,when executed in a computer, cause the computer to perform the methodsteps.

[0011] These and other features and advantages of the present inventionwill become more apparent from the following description of anillustrative embodiment of the invention considered together with thedrawing.

BRIEF DESCRIPTION OF THE DRAWINGS

[0012]FIG. 1 is a block diagram of a communications apparatus thatincludes an illustrative implementation of the invention;

[0013]FIG. 2 is a block diagram of a voice-activity detector (VAD) ofthe apparatus of FIG. 1;

[0014]FIG. 3 is a functional block diagram of a thresholds comparisonblock of the VAD of FIG. 2; and

[0015]FIG. 4 is a functional block diagram of an output decision blockof the VAD of FIG. 2.

DETAILED DESCRIPTION

[0016]FIG. 1 shows a communications apparatus. It comprises a userterminal 101 that is connected to a communications link 106. Terminal101 and link 106 may be either wired or wireless. Illustratively,terminal 101 is a voice-enabled personal computer and VoIP link 106 is alocal area network (LAN). Terminal 101 is equipped with a microphone 102and speaker 103. Devices 102 and 103 can take many forms, such as atelephone handset, a telephone headset, and/or a speakerphone. Terminal101 receives an analog input signal from microphone 102, samples,digitizes, and packetizes it, and transmits the packets on LAN 106. Thisprocess is reversed for input from LAN 106 to speaker 103. Terminal 101is equipped with a voice-activity detector (VAD) 100. VAD 100 is used todetect voice signal received from microphone 102 in order to, forexample, implement silence suppression and to determine half-duplextransitions.

[0017] According to the invention, an illustrative embodiment of VAD 100takes the form shown in FIG. 2. VAD 100 may be implemented in dedicatedhardware such as an integrated circuit, in general-purpose hardware suchas a digital-signal processor, or in software stored in a memory 107 ofterminal 101 or some other computer-readable medium and executed on aprocessor 108 of terminal 101. Illustratively, the analog output ofmicrophone 102 is sampled at a rate of 8K samples/sec. and digitized byterminal 101. VAD 100 receives a stream 200 of the digitized signalsamples and performs serial-to-parallel (S-P) conversion 202 thereon bybuffering the samples into frames of N samples, where N isillustratively 80. The frames are then passed through a high-pass filter204 to remove therefrom noise caused by the equipment-in-use or thebackground environment. Filter 204 is illustratively a 10^(th) orderinfinite impulse response (IIR) filter with a cut-off frequency around100 H_(z). The filtered frames are then distributed to components of afeature-extraction stage for computation of the following parameters:periodicity, total voiceband energy, and a high-low frequency energyratio.

[0018] Periodicity

[0019] The periodicity calculation involves first estimating a pitchperiod (T) 206 of the speech signal. Pitch-period estimation is known inspeech processing. The illustrative method used here may be found in L.R. Rabiner and R. W. Schafer, Digital Processing of Speech Signals,Prentice Hall, Englewood Cliffs, N.J. (1978), pp. 149-150. The value ofpitch period T that minimizes the average magnitude difference functionbelow is calculated as:${S(T)} = {\frac{1}{T}{\sum\limits_{n = 0}^{T}{{{x\lbrack n\rbrack} - {x\left\lbrack {n - T} \right\rbrack}}}}}$

[0020] where x[n] n=0, 1 . . . N−1 is the input signal to pitch period206 calculation. This is computed for T=T_(min), T_(min)+1, . . . ,T_(max). The constants T_(min) and T_(max) are the lower and upperlimits of the pitch period, respectively. The values chosen here are 19and 80. The value that minimizes the above function is represented asT_(opt). After finding T_(opt), a periodicity (C) 208 is illustrativelycomputed in a similar way to computation of the pitch prediction filterparameters used in speech codecs and detailed in R. A. Salami et al.,“Speech Coding”, Mobile Radio Communications, R. Steele (ed.), PentechPress, London (1992) pp. 245-253. A gain value (A) is computed as:$A = \frac{\sum\limits_{n = 0}^{T_{opt} - 1}{{x\lbrack n\rbrack}{x\left\lbrack {n - T_{opt}} \right\rbrack}}}{\sum\limits_{n = 0}^{T_{opt} - 1}\left\lbrack {x\left\lbrack {n - T_{opt}} \right\rbrack} \right\rbrack^{2}}$

[0021] The periodicity C is then given by:$C = \frac{\sum\limits_{n = 0}^{T_{opt}}\left\lbrack {{x\lbrack n\rbrack} - {{Ax}\left\lbrack {n - T_{opt}} \right\rbrack}} \right\rbrack^{2}}{\sum\limits_{n = 0}^{T_{opt} +}\left\lbrack {x\left\lbrack {n - T_{opt}} \right\rbrack} \right\rbrack^{2}}$

[0022] When the signal is fully periodic, C is 0. Conversely, when thesignal is random, C is 1.

[0023] Total Voiceband Energy

[0024] The total voiceband energy (E_(f)) 214 is computed for thevoiceband frequency range from 100 H_(z) to 4000 H_(z). The totalvoiceband energy in decibels is given by:$E_{f} = {10{\log_{10}\left\lbrack {\frac{1}{N}{\sum\limits_{n = 0}^{N - 1}{x\lbrack n\rbrack}^{2}}} \right\rbrack}}$

[0025] where x[n] n=0, 1, . . . , N−1 is the input signal to totalvoiceband energy 214 calculation.

[0026] High-Low Frequency Energy Ratio

[0027] Energy ratio (E_(r)) 224 is computed as the ratio of energy above2400 H_(z) to the energy below 2400 H_(z) in the input voiceband signal.To obtain the high-frequency signal, the output of high-pass filter 204is passed through a second high-pass filter 220 that has a cut-offfrequency of 2400 H_(z). The energy in decibels of the high-frequencysignal is given by:$E_{h} = {10{\log_{10}\left\lbrack {\frac{1}{N}{\sum\limits_{n = 0}^{N - 1}{x_{h}\lbrack n\rbrack}^{2}}} \right\rbrack}}$

[0028] where x_(h)[n] is the signal output by high-pass filter 220. Thehigh-low energy ratio (E_(r)) 224 is then given by:$E_{r} = \frac{E_{h}}{E_{f} - E_{h}}$

[0029] where E_(f) is the total voiceband energy 214.

[0030] To make the algorithm operate automatically, initial values ofthe parameters E_(f), E_(r), and C are computed for the first N_(i)frames that enter VAD 100 following initialization. Here N_(i) has beenchosen as 32. During this stage of computation, the minimum value ofE_(f) is computed and is denoted as E_(min). For every subsequent frame,running averages 212, 218, 228 are used together with smoothing of theparameters to make the algorithm less sensitive to local fluctuations.For the total voiceband energy and the energy ratio, differences 216 and226, respectively, between the smoothed frame values and the runningaverages are computed. These are denoted by ΔE_(f) and ΔE_(r). Theminimum energy value E_(min) is also updated, illustratively every 20frames.

[0031] After feature extraction, a comparison of the parameters is madewith several thresholds to generate an initial VAD (I_(VAD)), atthresholds comparison block 230. The procedure for this is illustratedin the flowchart of FIG. 3. Essentially, four different comparisons aremade based on the smoothed periodicity CS, energy difference ΔE_(f), andenergy-ratio difference ΔE_(r). Comparisons 304 and 306 are fordetecting voiced/periodic portions of speech. Comparisons 310 and 312are for detecting unvoiced/random portions of speech.

[0032] Threshold comparison 230 is performed anew for every frameprocessed by VAD 100. Upon startup of thresholds comparison 230, at step300 of FIG. 3, the value of I_(VAD) is initialized to zero, at step 302.A set of four comparisons is then made at steps 304, 306, 310, and 312.A comparison is made at step 304 to determine if ΔE_(f)<−7 dB andC_(s)<0.5; if so, voiced speech has been detected, as indicated at step308; if not, speech has not been detected, as indicated at step 318. Acomparison is made at step 306 to determine if C_(s)<0.15; if so, voicedspeech has been detected, as indicated at step 308; if not, speech hasnot been detected, as indicated at step 318. A comparison is made atstep 310 to determine if ΔE_(r)<−10; if so, unvoiced speech has beendetected, is indicated at step 314; if not, speech has not beendetected, as indicated at step 320. A comparison is made at step 312 todetermine if ΔE_(r)>10; if so, unvoiced speech has been detected, asindicated at step 314; if not, speech has not been detected, asindicated at step 320. If speech has been detected by any one or more ofthe comparisons 304, 306, 310, and 312, the value of I_(VAD) is set toone, at step 316; if speech has not been detected by any of thecomparisons, the value of I_(VAD) remains zero. Thresholds comparisonblock 230 then ends, at step 322.

[0033] After thresholds comparison 230 has been made to determine thevalue of I_(VAD), a final output decision is made at block 232. Aflowchart describing this block is shown in FIG. 4. Output decision 232is performed anew for every value of I_(VAD) produced by thresholdcomparison 230.

[0034] Upon startup of VAD 100, the values of a holdover flag H_(VAD)and a final VAD flag F_(VAD) are initialized to zero, at step 400. Uponreceipt of an I_(VAD) value from block 230, at step 402, output decision232 checks whether the received value of I_(VAD) is one, at step 404. Ifso, it means that speech has been detected, as indicated at step 406.Output decision 232 therefore sets H_(VAD) to one, at step 408, and setsF_(VAD) to one, at step 418. The value of F_(VAD) constitutes output 234of VAD 100. If the value of I_(VAD) is found to be zero at step 404,speech has not been detected, as indicated at step 409. However, outputdecision 232 checks if the value of H_(VAD) is set to one from aprevious frame, at step 410. If so, output decision 232 further checksif the smoothed value of E_(f) less the value of E_(min) is greater than8 dB, at step 412. If so, holdover is indicated, at step 414, and sooutput decision 232 maintains F_(VAD) set to one, at step 418, eventhough speech has not been detected. If the value of H_(VAD) is found tobe zero at step 410, or if the difference between the smoothed energyand the minimum energy computed at step 412 has fallen to less than 8dB, speech is not detected and there is no hold-over, as indicated atstep 415. Output decision 232 therefore sets the values of H_(VAD) andF_(VAD) to zero, at step 416. Following step 416 or 418, output decision232 ends its operation, at step 420, until the next I_(VAD) value isreceived at step 402.

[0035] Of course, various changes and modifications to the illustrativeembodiment described above will be apparent to those skilled in the art.For example, the noise-energy filter may be dispensed with. A differentvalue may be used for the high/low frequency threshold. Sampling of theinput signal may be affected at a different rate, especially at higherrates. The uppermost frequency of the voice band is subsequentlyincreased. The holdover may be dispensed with and the initial VAD outputI_(VAD) may be used as the final VAD output. A different procedure maybe used to estimate the pitch period or, the combined thresholdcomparison of the energy and periodicity may be replaced with a singleenergy threshold comparison. Such changes and modifications can be madewithout departing from the spirit and the scope of the invention andwithout diminishing its attendant advantages. It is therefore intendedthat such changes and modifications be covered by the following claimsexcept insofar as limited by the prior art.

What is claimed is:
 1. A method of voice activity detection comprising:determining a difference between (a) an average ratio of energy above afirst threshold frequency in a signal comprising multiple frequenciesand energy below the first threshold frequency in the signal and (b) apresent ratio of energy above the first threshold frequency in thesignal and energy below the first threshold frequency in the signal; andin response to the difference either being exceeded by a first thresholdvalue or exceeding a second threshold value greater than the firstthreshold value, indicating that the signal includes a voice signal. 2.The method of claim 1 wherein: the first threshold frequency is about2400 Hz.
 3. The method of claim 1 further comprising: prior to thedetermining, removing noise energy from the signal.
 4. The method ofclaim 3 wherein: removing comprises filtering out from the signalfrequencies below a second threshold frequency lower than the firstthreshold frequency.
 5. The method of claim 4 wherein: the secondthreshold frequency is about 100 Hz.
 6. The method of claim 1 furthercomprising: repeating the steps for successive segments of the signal.7. The method of claim 1 further comprising: determining an averageperiodicity of the signal; and in response to the average periodicity ofthe signal being lower than a third threshold value, indicating that thesignal includes a voice signal.
 8. The method of claim 7 wherein:determining an average periodicity comprises estimating a pitch periodof the signal; determining a gain value of the signal over the pitchperiod as a function of the estimated pitch period; determining aperiodicity of the signal over the pitch period as a function of theestimated pitch period and the gain value; and averaging the determinedperiodicity with previously-determined at least one said determinedperiodicity.
 9. The method of claim 7 further comprising: repeating thesteps for successive segments of the signal.
 10. The method of claim 7further comprising: determining a difference between average totalenergy in the signal and present total energy in the signal; and inresponse to the difference between the average total energy and thepresent total energy being lower than a fourth threshold value and theaverage periodicity of the signal being lower than a fifth thresholdvalue, indicating that the signal includes a voice signal.
 11. Themethod of claim 10 further comprising: prior to determining thedifference between the average total energy and the present totalenergy, removing noise energy from the signal.
 12. The method of claim 1wherein: determining a difference between the average total energy andthe present total energy comprises determining a difference betweenaverage total energy in a voiceband of the signal and present totalenergy in the voiceband.
 13. The method of claim 12 wherein: thevoiceband extends from about 100 Hz to about 4000 Hz.
 14. The method ofclaim 10 further comprising: repeating the steps for successive segmentsof the signal.
 15. The method of claim 14 further comprising: inresponse to not indicating for a present segment of the signal that thesignal includes a voice signal, and indicating for a segment of thesignal preceding the present segment that the signal includes a voicesignal, determining if the average total energy of the signal exceeds aminimum average total energy of the signal by a sixth threshold value;and in response to the average total energy exceeding the minimumaverage total energy by the sixth threshold value, indicating that thesignal includes a voice signal.
 16. An apparatus that performs themethod of any one of the claims 1-15.
 17. A computer-readable mediumcontaining executable instructions which, when executed in a computer,cause the computer to perform the method of any one of the claims 1-15.18. An apparatus for detecting voice activity comprising: means fordetermining an average ratio of energy above a first threshold frequencyin a signal comprising multiple frequencies and energy below the firstthreshold frequency in the signal; means for determining a present ratioof energy above the first threshold frequency in the signal and energybelow the first threshold frequency in the signal; means for determininga difference between the average ratio and the present ratio; and meanscooperative with the means for determining a difference and responsiveto the difference either being exceeded by a first threshold value orexceeding a second threshold value greater than the first thresholdvalue, for indicating that the signal includes a voice signal.
 19. Theapparatus of claim 18 further comprising: means for determining anaverage periodicity of the signal; and means cooperative with the meansfor determining an average periodicity and responsive to the averageperiodicity being lower than a third threshold value, for indicatingthat the signal includes a voice signal.
 20. The apparatus of claim 19further comprising: means for determining a difference between averagetotal energy in the signal and present total energy in the signal; andmeans cooperative with the means for determining a difference betweenthe average total energy and the present total energy and the means fordetermining an average periodicity and responsive to the differencebetween the average total energy and the present total energy beinglower than a fourth threshold value and the average periodicity of thesignal being lower than the fifth threshold value, for indicating thatthe signal includes a voice signal.
 21. The apparatus of claim 20 fordetecting voice activity in successive segments of the signal, furthercomprising: means responsive to a lack of indication for a presentsegment of the signal that the signal includes a voice signal and to anindication for a segment of the signal preceding the present segmentthat the signal includes a voice signal, for determining if the averagetotal energy of the signal exceeds a minimum average total energy of thesignal by a sixth threshold value; and means cooperative with the meansfor determining of the average total energy exceeds the minimum averagetotal energy and responsive to the average total energy exceeding theminimum average total energy by the sixth threshold value, forindicating that the signal includes a voice signal.